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Summary  Page 


PROBLEM 

Human  performance  testing  results  in  scores  which  represent  the 
performance.  The  scores  indicate  differences  among  people,  alterations 
due  to  types  of  stimuli  within  a  test  (e.g.,  changing  signal  intensity), 
effects  of  changes  in  the  test  environment,  and,  if  the  tests  are  repeated, 
effects  of  practice.  Mathematical  descriptions  of  these  differences  and 
changes  are  compared  with  the  data  to  indicate  which  types  of  effects 
occurred.  The  mathematical  models  are  usually  statistical,  due  to  the 
variability  of  the  effects.  The  problem  is  that  the  utility  of  the  test 
scores  is  limited  by  the  generality  and  accuracy  of  the  statistical- 
mathematical  models  used  to  interpret  the  data. 


FINDINGS 


1.  Correlations  between  tests  at  each  stage  of  practice  can  be  useful 
to  show  changes  of  what  is  measured  by  the  tests. 

2.  There  are  many  ways  to  detect  changes  of  individual  differences 
during  practice  (e.g..  Chi-square  statistics,  graphical  methods,  factor 
analysis,  analysis  of  variance).  None  of  the  techniques  studied  is  entirely 
satisfactory. 

3.  Signal  detection  theory  can  be  useful  for  analysis  of  performance 
tests  involving  comparisons  of  stimuli  with  a  standard  stimulus. 

4.  Time  series  analysis  can  be  used  to  explain  how  performance  changes 
over  time. 


RECOMMENDATIONS 

Human  performance  data  should  be  compared  with  some  sort  of  model  or 
hypothesis  about  effects  represented  by  the  data.  Several  useful  models 
are  presented,  and  their  application  is  recommended  in  appropriate  contexts. 


Trade  names  of  materials  or  products  of  commercial  or  non-goverment 
organizations  are  cited  where  essential  for  precision  in  describing  research 
procedures  or  evaluation  of  results.  Their  use  does  not  constitute  official 
endorsement  or  approval  of  the  use  of  such  commercial  hardware  or  software. 

This  research  work  was  funded  by  the  Naval  Medical  Research  and  Develop¬ 
ment  Command  and  by  the  Biological  Sciences  Division  of  the  Office  of  Naval 
Research. 
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Abstract 

Video  Games 

In  1972  a  coin-operated  video  game  called  Pong  and  manufactured  by 
Atari,  Inc.,  a  company  founded  that  same  year,  appeared  on  the  electronic- 
games  market.  In  less  than  a  year  Atari  sold  6,000  games  at  more  than 
$1,000  apiece.  Midway  Manufacturing  Co.,  which  Atari  licensed  to  produce 
a  version  of  Pong,  sold  9,000  of  the  table- tennis  type  games  in  less  than 
six  months. 

Also  in  1972,  Magnavox  marketed  a  video  game  called  .Odyssey  that 
could  be  played  on  home  TV  sets.  The  Odyssey  set  included  a  control  unit, 
which  attached  to  a  home  TV  set  and  permitted  one  to  play  12  different 
games  by  inserting  a  "game  card"  into  the  control  unit.  The  original 
Odyssey  was  not,  however,  a  programmable  video  game.  All  12  games  were 
resident  in  the  control  unit  and  were  not,  in  fact,  very  different;  the 
"game  card"  set  appropriate  lines,  bars,  and  cursors.  Then  in  1975  Atari 
entered  the  home  video  market  with  a  version  of  Pong  that  offered  several 
new  advances:  electronically  generated  on-screen  courts,  sound  effects 
for  every  hit,  miss,  and  ricochet,  and  automatic  on-screen  digital  scoring. 
By  the  end  of  1976  twenty  different  companies,  including  Coleco,  First 
Dimension,  National  Semiconductor,  Phoenix,  Unisonic,  and  Universal 
Research  were  producing  video  games  for  home  use. 

About  this  time,  that  is,  late  in  1976,  Fairchild  Camera  and  Instru¬ 
ment  entered  the  field  with  the  first  fully  programmable  video  system.  The 
system  was  programmed  by  inserting  an  electronic  cartridge  into  the  game 
console.  The  benefit  was  that  one  could  play  as  many  different  games  as 
the  company  provided  cartridges  —  in  fact,  more,  because  most  cartridges 
contained  several  games.  Different  games  within  the  same  cartridge  were 
selected  by  punching  in  a  number  on  the  control  console. 

In  1977  and  1978  programmable  video  games  for  home  use  proliferated 
on  all  sides.  Companies  already  in  the  field,  like  Atari  and  Magnavox, 
came  out  with  programmable  video  systems;  and  new  companies  entered  the 
field,  for  example,  RCA  and  Bally,  the  pinball-machine  company.  In  1978 
American  shoppers  spent  more  than  200  million  dollars  on  programmable 
home  video  games  and  everything  pointed  toward  an  even  larger  market  in 
the  future. 
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Video  games  as  psychological  tests 

The  potential  of  programmable  video  games  for  psychological  testing 
is  large.  First,  the  new  games  involve  skills  and  lots  of  them.  Video 
games  are  tasks  and  playing  them  repeatedly  constitutes  so  many  trials  of 
practice.  The  more  a  person  plays  the  better  he  or  she  becomes,  especially 
in  the  beginning;  after  extended  practice,  the  gains  from  playing  yet 
another  game  are  small  or  non-existent.  Most  of  the  games,  moreover,  have 
a  high  ceiling,  so  high  that  few  people  come  close  to  reaching  it.  Second, 
the  new  games  are  wonderfully  self-motivating.  A  case  can  be  made  that 
for  research  purposes  solid  motivation  is  not  all  to  the  good.  Insuffi¬ 
cient  motivation,  boredom,  or  wavering  attention  may  be  precisely  what  the 
investigator  wishes  to  study;  and  in  such  a  case  video  games  would  not  he 
the  tasks  of  choice.  More  often,  however,  we  are  interested  in  skill 
acquisition,  learning  or  forgetting,  as  distinct  from  performance;  and 
where  we  are,  insufficient  or  wavering  motivation  is  quite  simply  a  source 
of  error.  Third  and  last,  most  video  games  are  highly  speeded.  In  fact, 
this  feature  of  the  games  may  account  for  much  of  their  appeal.  In 
considerable  measure  the  games  are  enjoyable  because  they  operate  at  more 
or  less  the  same  speeds  as  we  do,  that  is,  as  our  brains  do.  Their  being 
so  fast,  however,  may  permit  them  to  tap  aspects  of  human  functioning 
that  escaped  us  as  long  as  we  were  dealing  with  essentially  mechanical 
tasks  (pursuit  rotor,  two-hand  or  complex  coordination) . 

Programmable  video  games  are  equally  attractive  at  a  pragmatic  level, 
especially  for  performance  testing.  Literally  dozens  of  games  and,  in 
principle,  hundreds  or  even  thousands  can  be  played  with  identically  the 
same  equipment;  one  need  only  insert  another  cartridge.  Television  sets 
are  light,  easily  transported,  and  occupy  little  space.  Furthermore,  if 
they  break  down,  they  are  easily  replaced.  The  game  console  and  associated 
cartridges  are  robust.  The  only  parts  of  game  equipment  that  show  any 
appreciable  tendency  to  break  down  are  the  joysticks,  wheels,  knob's,  etc. 
that  the  subject  manipulates;  but  these  too  are  easily  replaced. 

Stabilization  and  task  definition 

Despite  these  many  advantages  psychologists  have  not  rushed  to  study 
thq  new  games  or  use  them  in  prediction  and  performance  testing.  The 
first  studies  of  programmable  video  games  from  a  psychological  standpoint 
were  begun  in  the  late  summer  of  1978  at  the  Navy  Aerospace  Medical 
Research  Laboratory  (NAMRL)  in  New  Orleans.  The  purpose  of  the  NAMRL 
studies  was  to  find  out  whether  or  not  the  video  games  were  suitable  for 
inclusion  in  a  performance  test  battery  for  environmental  research. 

A  prime  requirement  of  any  performance  test  is  that  it  stabilize.  In 
a  good  performance  test  there  comes  a  point  in  practice  after  which 
individual  performance  does  not  change  in  the  absence  of  external  changes. 
In  group  terms  the  mean  follows  a  flat  course,  the  variance  among  subjects 
remains  the  same  from  one  trial  to  the  next,  and  all  correlations  among 
stabilized  trials  are  equal  except  for  sampling  variations.  If  a  test 
satisfies  these  requirements,  it  may  be  used  to  study  the  impact  of 
environmental  variations  on  performance.  If  it  does  not,  it  is  at  best 
difficult  to  determine  whether  an  observed  change  in  performance  is  a 
practice  effect  or  the  result  of  environmental  changes.  An  additional 
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requirement  is  that  task  definition  (the  average  correlation  among 
stabilized  trials)  be  high,  preferably  greater  than  .90. 

In  the  New  Orleans  laboratory  a  large  number  of  conventional  tests 
and,  after  September,  1978,  video  games  have  been  studied  over  extended 
periods  of  practice,  15  consecutive  working  days,  with  a  view  to  finding 
out  how  quickly,  if  at  all,  they  stabilize  and  how  well  defined  they  are. 
So  far  nine  video  games  have  been  studied  in  small  samples  (roughly  13 
subjects)  and  one  game.  Air  Combat  Maneuvering  (ACM),  has  been  studied  in 
roughly  twice  that  number  of  subjects.  All  ten  video  games  are  made  by 
Atari. 

ACM  is  a  remarkc-ole  task.  The  mean  follows  a  classical  learning 
curve,  rising  rapidly  in  the  early  trials  and  then  gradually  flattening 
out.  The  variance  among  subjects  stabilizes  after  day  8  and  the  inter¬ 
trial  correlation  after  day  6.  Task  definition  is  very  high,  .93.  In 
the  first  six  days  of  practice,  that  is,  prior  to  stabilization,  the 
intertrial  correlations  show  an  exceptionally  regular  superdiagonal  form. 
Altogether  ACM  not  only  meets  the  requirements  laid  down  for  it  as  a 
performance  test  but  does  so  more  fully  than  any  conventional  test,  with 
one  exception,  studied  at  NAMRL.  The  exception  is  Arithmetic,  a  conven¬ 
tional  test  that  seems  to  be  stable  from  the  outset;  the  reason,  in  all 
probability,  is  that  arithmetical  skills  have  been  so  thoroughly 
practiced  in  school  and  everyday  life  that  the  subjects  come  to  the  lab¬ 
oratory  at  or  near  asymptotic  levels. 

Data  concerning  other  video  games  studied  at  NAMRL  are  more  pre¬ 
liminary.  It  does  seem,  however,  that  some  other  games  are  as  promising 
for  performance  testing  as  ACM.  Breakout,  for  example,  seems  also  to 
stabilize  after  six  days,  though  with  poor  task  definition,  .77.  It  also 
seems  that  video  games  do  not  all  depend  on  the  same  underlying  skills  and 
abilities  since  the  correlations  between  tasks  are  in  some  cases  quite  low. 

Convergence-divergence  relations 

The  present  report  focuses  on  convergence-divergence  relations  among 
video  games.  When  a  task  is  practiced,  its  correlation  with  an  external 
measure  may  increase,  decrease,  or  remain  the  same,  to  take  linear 
possibilities  only  into  account.  If  the  correlation  increases,  the  task 
is  said  to  converge  on  the  external  measure;  if  it  decreases,  the  task 
diverges  from  the  external  measure. 

Table  1  presents  the  cross-correlations  between  ACM  and  Breakout  in 
13  Navy  enlisted  volunteers.  Each  subject  played  10  games  of  ACM  a  day 
for  15  consecutive  working  days,  followed  by  10  games  of  Breakout  a  day 
for  another  15  consecutive  working  days.  His  score  each  day  was  the  - 
average  of  the  10  games  played. 

Now  consider  the  row  averages.  These  figures  represent  the  correla¬ 
tion  between  each  of  the  15  days  on  ACM  with  the  15  days  on  Breakout 
considered  as  a  whole.  Testing  for  linear  trend  in  a  two-way  analysis  of 
variance,  vising  the  interaction  between  rows  and  columns  as  the  error 
term,  shows  a  small  but  significant  tendency  (p  £■  .01)  for  ACM  to  converge 
on  Breakout.  The  regression  line  rises  by  .07  from  day  1  to  day  15. 
Breakout,  on  the  other  hand,  converges  strongly  on  ACM.  The  regression 
line  for  the  column  averages  rises  by  .33  from  day  1  to  day  15. 
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Two  points  are  worth  underscoring.  First,  convergence-divergence 
relations  are  not  symmetrical.  Because  task  A  converges  on  task  B  it 
does  not  follow  that  task  B  converges  on  task  A;  task  B  may,  in  fact, 
diverge  from  task  A.  Second,  Breakout  followed  ACM  in  time.  Therefore, 
the  correlations  between  Breakout  and  ACM  increased  with  increasing  tem¬ 
poral  separation.  Day  1  on  Breakout  followed  ACM  directly  while  day  15 
came  almost  three  weeks  later.  Nevertheless,  the  correlations  with  ACM 
increased  systematically  over  this  interval.  This  result  is  without 
precedent  in  the  literature  of  differential  psychology;  in  all  other 
studies  the  correlation  between  the  same  or  similar  measures  either 
decreases  with  increasing  temporal  separation  or  remains  the  same. 

ACM  and  Breakout  were  the  first  two  in  a  series  of  five  video  tasks; 
the  other  tasks  were,  in  order,  Surround,  Race  Car,  and  Slalom.  The  same 
13  subjects  practiced  all  five  tasks.  Breakout  converges  strongly  not 
only  on  ACM  but  on  the  other  three  tasks  as  well;  linear  change  over 
the  15  days  is  roughly  the  same  in  all  four  cases,  on  the  order  of  .30. 
ACM,  however,  shows  no  change  with  Surround,  a  slight  but  significant 
divergence  from  Race  Car  and  a  stronger  divergence  from  Slalom.  The 
linear  decrease  from  day  1  to  day  15  is  .06  for  Race  Car  and  .13  for 
Slalom.  The  last  two  cases  are  the  obverse  of  the  relations  between  ACM 
and  Breakout.  ACM  precedes  Race  Car  and  Slalom.  Therefore,  since  it 
diverges  from  these  two  tasks,  the  correlation  between  ACM  and  Race  Car  or 
Slalom  decreases  as  ACM  gets  closer  and  closer  temporally  and  sequentially 
to  the  two  following  tasks.  These  results  are  also  without  precedent  in 
the  differential  literature. 

Application  to  pilot  selection 

A  test  converges  on  or  diverges  from  a  training  criterion  according 
as  the  correlation  between  test  and  criterion  increases  or  decreases  with 
practice  on  the  test.  If  the  test  diverges,  there  is  plainly  no  point 
in  extending  practice  on  the  test  since  the  effect  is  to  lower  predictive 
validity;  if  it  converges,  however,  there  may  be  no  predictive  validity 
at  all  without  extended  practice. 

Pilot  training  takes  place  in  a  series  of  stages,  each  one  (except 
thd  first)  building  on  at  least  some  of  the  preceding  stages.  It  is 
possible,  therefore,  to  speak  not  only  of  a  test  converging  on  or  diverging 
from  the  criterion  but  also  of  the  criterion  converging  on  or  diverging 
from  a  test.  If  the  correlation  between  flight  grades,  for  example,  and 
a  test  increases  with  level  of  training,  the  criterion  converges  on  the 
test.  If  the  correlation  decreases  as  students  progress  to  more  and  more 
advanced  stages,  the  criterion  diverges  from  the  test.  In  the  first  case, 
where  training  criteria  converge  on  a  test,  we  have  reason  to  believe  that 
the  test  will  predict  operational  performance  at  least  as  well  as  it  does 
performance  in  training.  If  the  training  criterion  diverges  from  a  test, 
however,  the  test  may  easily  be  valid  in  training  but  much  less  so  or  not 
at  all  in  operations. 
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TABLE  1 

Cross -cor re  lac ions  between  Air  Combat  Maneuvering  (ACM)  and  Breakout  in  13  Navy  voluntee 


^Decimal  points  omitted. 
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CONVERGENCE-DIVERGENCE  WITH  EXTENDED  PRACTICE:  THREE  APPLICATIONS 
Marshall  B.  Jones 
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ABSTRACT 

When  a  task  is  practiced,  its  correlation  with  an  external  measure  may 
increase,  decrease,  or  remain  the  same,  to  take  only  linear  possibilities  into 
account.  If  the  correlation  increases,  the  task  is  said  to  converge  on  the 
external  measure;  if  it  decreases,  the  task  diverges  from  the  external  measure. 
This  simple  notion  has  many  applications,  some  of  them  entailing  important 
theoretical  consequences.  The  present  paper  discusses  three  of  these  applications. 


INTRODUCTION 

The  first  author  to  recognize  that  prac¬ 
ticing  a  task  might  alter  its  correlations  with 
other  measures  was  Herbert  Woodrow  (1939).  His 
main  finding  was  that  the  correlation  between 
tasks  tended  to  weaken  with  practice  on  one  of 
them.  For  15  years  Woodrow's  studies  along  this 
line  were  not  pursued  by  other  workers.  Then, 
in  the  early  1950*s,  a  series  of  investigations 
under  Air  Force  auspices  (Adams,  1953;  Fleishman 
and  Hempel,  1953;  Reynolds,  1952)  showed  beyond 
any  serious  question  that  the  correlations  of 
many  tasks  with  other  measures  change  with  prac¬ 
tice.  In  general,  Woodrow's  early  generaliza¬ 
tion  held  up,  that  is,  most  tasks  were  more 
strongly  correlated  with  external  measures  early 
In  practice  than  they  were  later  on;  but  there 
were  many  exceptions.  Depending  on  the  particu¬ 
lar  task  that  was  practiced  and  the  particular 
external  measure,  the  correlation  between  the 
two  might  increase,  decrease,  or  remain  the  same 
as  the  task  was  practiced. 

Recently,  Jones,  Kennedy,  and  Bittner  (1980) 
introduced  the  phrase  "convergence  with  practice" 
to  indicate  increasing  correlations  between  a 
task  and  an  external  measure;  similarly,  ‘'di¬ 
vergence"  means  that  the  correlation  between  a 
task  and  an  external  measure  decreases  with 
practice.  The  present  paper  develops  this  idea 
in  three  different  settings:  differential  re¬ 
tention  over  long  periods  of  no  practice,  per¬ 
sonnel  selection  and  classification,  and  the 
identification  of  latent  factors. 

DIFFERENTIAL  RETENTION 

Under  constant  conditions,  most  individuals 
reach  a  point  on  most  tasks  where  they  are  no 
longer  improving  with  practice  or  improving  at 
a  slow  and  regular  rate;  at  this  point  each 
individual  is  at  or  near  his  or  her  asymptotic 
level.  An  array  of  such  levels  is  called  a 
terminal  process  (Jones,  1970a&b).  At  earlier 
points  in  practice  an  individual  * s  level  of 
performance  can  be  analyzed  into  two  parts,  one 
reflecting  the  terminal  process  and  the  other 
individual  differences  in  approach  to  terminal 
levels.  Jones  (1970a&b)  calls  this  second  part 
the  rate  process ■ 
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Suppose  now  that  practice  stops  with  all 
subjects  at  or  near  terminal  levels  and  is 
followed  by  a  long  period  (several  months  at  a 
minimum)  of  no  practice.  When  practice  is 
resumed,  many  individuals  will  no  longer  be  per¬ 
forming  at  terminal  levels.  As  retraining  pro¬ 
ceeds,  however,  the  subjects  should,  arcordinp 
to  Jones1  two-process  theory  of  individual  dif¬ 
ferences  in  skill  acquisition,  return  to  their 
original  terminal  levels.  In  consequence,  the 
correlation  between  terminal  level  in  original 
practice  and  performance  in  retraining  should 
increase  with  retraining  session  number.  Put 
differently,  the  retraining  sessions  should 
converge  on  terminal  levels  in  original  learning 

This  consequence,  it  should  b”  pointed  out, 
has  no  precedent  in  correlations  among  tem¬ 
porally  ordered  measures  of  the  same  sort.  The 
well-nigh  universal  rule  in  matrices  of  this 
description  is  that  correlation  decreases  with 
temporal  separation  (Jones,  1969,  1972).  Our 
consequence,  however,  calls  for  original 
learning  to  correlate  most  strongly  with  the 
temporally  most  removed  measure,  that  is,  the 
last  retraining  session,  and  least  strongly  with 
the  measure  closest  to  it  in  time,  that  is,  the 
first  retraining  session. 

A  study  to  test  this  reasoning  is  currently 
underway  at  the  Navy  Biodynamics  Laboratory  in 
New  Orleans.  The  design  calls  for  24  subjects, 
six  tasks  (all  video  games  manufactured  by  Atari 
Inc.),  and  three  retention  intervals  (4-6 
months,  10-12  months,  and  16-18  months).  We 
will  consider  one  of  these  tasks,  Air  Combat 
Maneuvering  (ACM),  in  some  detail.  All  24  sub¬ 
jects  practiced  ACM  ten  games  a  day  for  15  con¬ 
secutive  working  days  and  the  ten  games  a  sub¬ 
ject  played  each  day  were  averaged  to  obtain  a 
single  data  point  for  each  individual  on  each 
day;  the  retention  interval  is  16-18  months. 

Air  Combat  Maneuvering  is  a  remarkable 
task  (Jones,  1979).  The  mean  follows  a  classi¬ 
cal  learning  curve,  rising  rapidly  in  the  early 
trials  and  then  gradually  flattening  out;  the 
variance  among  subjects  stabilizes  after  day  8. 
The  36_correla tions  among  days  7  through  15  are 
high,  r  »  .93,  and  differ  from  one  another  no 
more  than  one  would  expect  from  sampling  con- 
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siderations  (Lawley's  chi-squared  test).  In 
short,  from  day  7  on  the  subjects  are  all  at  or 
near  their  terminal  levels,  except  for  small 
amounts  of  random  error.  The  average,  there¬ 
fore,  of  a  subject’s  score  on  these  nine  days 
is  a  close  estimate  of  that  Individual's  ter¬ 
minal  level. 

A  half  dozen  of  these  subjects  have  been 
returned  to  practice  for  five  consecutive 
working  days  after  16-18  months  of  no  practice. 
The  question  at  issue  is  the  correlation  be¬ 
tween  terminal  level  as  estimated  from  the 
average  of  days  7  through  15  and  the  five  days 
of  retraining.  If  Jones’  theory  is  correct, 
this  correlation  will  be  lowest  on  day  1  of 
retraining  and  highest  on  day  5. 

PERSONNEL  SELECTION 

The  second  application  concerns  personnel 
selection  in  cases  where  "the  criterion"  de¬ 
velops  in  a  series  of  stages  or  phases.  Pilot 
training,  for  example,  takes  place  in  stages 
(pre-solo,  precision,  acrobatics,  etc.),  and 
each  6tudent  who  completes  training  receives  a 
flight  grade  for  each  stage.  In  such  a  case 
we  have  two  kinds  of  convergence-divergence  to 
consider.  First,  does  a  predictor  task  con¬ 
verge  on  or  diverge  from  the  flight  training 
criterion,  taken,  let  us  say,  as  the  average 
flight  grade  in  advanced  training?  If  it  con¬ 
verges  on  the  criterion,  then  predictive  va¬ 
lidity  increases  with  practice.  If  the  task 
diverges  from  the  criterion  with  practice,  then 
predictive  validity  decreases  with  practice. 

In  the  latter  case  there  is  no  point  in  ex¬ 
tending  practice  on  the  predictor  task;  in 
the  former  case  there  may  be,  especially  if  the 
increase  in  predictive  validity  is  sizable. 

Flight  grades  may  also  converge  on  or 
diverge  from  the  predictor  task,  taken,  let  us 
say,  as  terminal  levels  of  performance.  This 
is  the  second  kind  of  convergence-divergence  in 
the  selection  context.  If  flight  grades  di¬ 
verge  from  the  predictor  task  as  a  student  pro¬ 
gresses  to  more  and  more  advanced  stages  of 
training,  the  task  may  easily  be  valid  in 
training  but  much  less  so  or  not  at  all  in  oper¬ 
ations.  If,  on  the  other  hand,  flight  grades 
converge  on  the  predictor  task,  there  is  reason 
to  believe  that  the  test  will  predict  opera¬ 
tional  performance  at  least  as  well  as  it  does 
performance  in  training. 

Predictor-criterion  relations  are  seri¬ 
ously  oversimplified  when  presented  in  a  static 
point-to-point  way.  In  many  selection  programs 
both  the  predictor  and  the  criterion  change 
systemat ically  with  practice  or  training; 
where  they  do,  it  is  crucial  to  know  not  just 
the  overall  magnitude  of  predictor-criterion 
relations  but  how  these  relations  change  with 
stage  of  training  or  practice  on  the  predictor 
task . 


FACTOR  IDENTIFICATION 

The  third  application  concerns  factor- 
referenced  tests.  It  has  been  known  for  more 
than  20  years  that  the  factorial  content  of  a 
skilled  task  changes  with  practice  (Fleishman, 
1960;  Fleishman  and  Herapel,  1953).  Twenty 
years  ago,  however,  it  was  customary  to  draw  a 
sharp  distinction  between  skills  and  abilities. 
Skills  were  tasks  of  practical  relevance  and 
they  were  practiced.  Abilities  were  measured 
by  "reference  tests"  which  were  administered 
for  short  periods  of  time  only  and  usually  had 
little  or  no  practical  Importance.  Skills  were 
narrow  in  scope,  whereas  abilities  were  broad 
and  enduring.  To  a  large  extent,  this  distinc¬ 
tion  was  always  arbitrary;  it  was  only  a  con¬ 
vention  that  said  skilled  tasks  could  be  prac¬ 
ticed  whereas  reference  tests  could  (or  should) 
be  administered  to  the  same  subjects  once  or 
twice  only.  As  long  as  it  lasted,  however,  the 
distinction  served  to  contain  and  limit 
Fleishman’s  findings  about  differential  changes 
with  practice.  One  thought  of  skills  as 
changing  with  practice;  nothing  was  said  about 
abili ties . 

In  recent  years  the  idea  of  abilities  as 
broad,  measureable,  enduring  traits  has  been 
called  into  question  along  several  lines 
(Alvares  and  Hulin,  1972,  1973;  Mischel,  1968; 
Hviraphreys,  1968).  One  such  line  is  to  treat 
tests  of  ability  like  other  tasks,  that  is,  to 
allow  practice.  Some  f ac tor-re f erenced  tests 
can  be  practiced  as  they  stand,  others  require 
multiple  parallel  forms.  In  any  ca.se,  when 
practice  is  allowed,  tests  of  ability  behave 
just  like  other  tasks.  People  improve  with 
practice,  usually  in  a  negatively  accelerated 
way;  correlations  between  trials  of  practice 
follow  the  usual,  so-called  superdiagonal 
pattern  (Jones,  1969,  1972),  with  intertrial 
correlations  decreasing  with  increasing  serial 
or  temporal  separation;  finally,  as  practice 
proceeds,  tests  converge  on  or  diverge  from 
some  external  measures,  including  other  tests. 

Code  substitution  or  digit-symbol ,  as  it 
is  also  called,  is  a  good  example.  The  test  is 
generated  by  pairing  the  nine  digits  (1,2,3,.., 
9)  with  nine  crbitrarilv  chosen  letters.  The 
letters  are  then  presented  in  a  random  series 
numbering,  perhaps,  100  to  200  letters  in  all 
and  the  subject  required  to  write  down  the 
paired  digit  for  as  many  letters  as  possible 
in  the  time  allowed.  The  usual  measure  is 
number  correct  or  time  to  finish,  if  the  series 
is  short  relative  to  time  allowed. 

In  one  form  or  another  this  test  has  been 
included  in  intelligence  tests  since  the  First 
World  War.  It  was  part  of  the  Army  Alpha  and 
Beta  tests  (Pintner,  1923)  and  the  original 
version  of  the  Wechsler-Bel levue  Intelligence 
Scale  (Vechsler,  1958).  The  Differential 
Aptitude  Tests,  General  Aptitude  Test  Battery, 
and  most  tests  of  clerical  aptitude  include  some 
form  of  the  code  substitution  test  (Buros,  1Q72’). 
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Recently,  Pepper,  Kennedy,  and  Bittner 
(1980)  administered  alternate  forms  of  the  code 
substitution  test  to  19  Navy  enlisted  men  for 
15  consecutive  working  days.  Each  alternate 
form  consisted  of  135  letters  with  its  own  ran¬ 
domly  chosen  letter-digit  pairing.  The  subjects 
were  Instructed  to  write  down  the  paired  digit 
beneath  each  letter  and  given  two  minutes  to 
complete  as  many  pairs  as  they  could.  The 
measure  of  performance  was  total  number  correct. 

Mean  performance  on  this  test  increased 
regularly  from  68.3  correct  on  the  first  day  to 
80.2  correct  on  the  15th  day;  the  variance 
among  subjects  stabilized  after  day  7.  Inter¬ 
trial  (Interday)  correlations  showed  unmistak¬ 
able  evidence  of  differential  change  for  the 
first  five  days  but  little  or  none  thereafter. 
The  average  correlation  among  the  last  ten 
days,  however,  was  not  as  high  as  one  might 
wish,  .72  (Jones,  1979). 

This  study  was  carried  out  at  the  Navy 
Biodynamics  Laboratory  and  many  of  the  same 
subjects  were  also  given  other  tasks,  for 
example.  Air  Combat  Maneuvering.  The  correla¬ 
tions  of  code  substitution  with  several  of 
these  other  tasks  changed  systematically  with 
practice,  in  some  cases  increasing  and  in 
others  decreasing. 

In  short,  code  substitution  behaves  just 
like  other  tasks  when  it  is  practiced,  its 
differential  content  changes;  and  what  is  true 
of  code  substitution  is  probably  true  of  most 
tests,  including  tests  used  to  identify  latent 
factors.  This  fact  poses  serious  problems  for 
factor  analysis. 

Suppose,  for  example,  that  code  substitu¬ 
tion  loads  heavily  on  factor  A  when  it  Is  little 
practiced  but  only  very  modestly  when  it  is 
well  practiced.  Given  the  first  result,  the 
usual  conclusion  would  be  that  factor  A  had 
something  to  do  with  clerical  aptitude,  speed, 
or  accuracy.  But  if  that  is  true,  then  why 
doesn't  code  substitution  load  heavily  on 
factor  A  when  it  is  well  practiced?  If  factor 
loadings  change  with  practice — and  this  much 
is  foregone  given  that  tests  converge  on  or 
diverge  from  one  another  with  practice,  then 
how  are  ability  factors  to  be  named?  The  same 
test  that  loads  heavily  on  a  factor  at  one 
stage  of  practice  may  not  do  so  at  another;  yet 
the  content  of  the  test,  its  behavioral  require¬ 
ments,  remains  the  same. 

One  way  out  of  this  dilemma  Is  to  equate 
test  content  with  terminal  levels  of  perfor¬ 
mance.  On  this  view,  early  stages  of  skill 
acquisition  reflect  previous  experience,  differ¬ 
ences  in  exposure,  variations  in  learning  style, 
etc.;  it  is  only  late  in  practice,  when  sub¬ 
jects  approach  asymptotic  performance,  that  one 
can  say,  "these  differences  reflect  test  con¬ 
tent."  This  view  also  entails  difficulties, 
however.  Taken  seriously,  It  means  that  most 
factor-analytic  attempts  to  i dent i fy  ’under lying 


abilities  are  improperly  done  since  very  few  of 

them  involve  extended  practice  on  any  task. 
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ABSTRACT 

t.  This  paper  evaluates  three  methods  for  assessing  "differential  stability"  These 
methods  are  Craphical  Analysis,  Early  vs.  Late  Correlational  ANOVA,  and  the  Lawley  Test 
of  Correlational  Equality.  It  is  recommended  that  Craphical  Analysis  be  the  method  of 
first  choice  with  the  Early  vs.  Late  method  utilized  only  where  there  is  a  need  for 
formal  confirmation. 


INTRODUCTION 

Background 

Development  of  Performance  Evaluation 
Tests  for  Environmental  Research  (PETER)  is 
currently  taking  place  at  a  number  of  govern¬ 
ment,  university  and  industrial  facilities. 
PETER  is  a  human  performance  task  battery  which 
is  being  specifically  designed  for  repeated 
administration  in  exotic  environments.  Focus 
on  repeated  administrations  was  motivated  by 
recognition  that  the  most  frequently  and  almost 
exclusively  used  paradigm  in  environmental 
research  uses  repeated  measurements  of  sub¬ 
jects.  With  and  without  control  groups,  this 
paradigm  typically  employs  measurements  of 
subjects  in  "before",  "during"  and  "after" 
exposure  conditions.  Suitability  of  tasks  for 
repeated  administrations  is  a  unique  focus  of 
PETER  not  considered  in  previous  battery  devel¬ 
opments  (Kennedy  &  Bittner,  1978;  Kennedy, 
Bittner  &  Harbeson,  1979). 

The  repeated  measures  analysis  of  variance 
(ANOVA),  almost  universally  applied  to  environ¬ 
mental  paradigm  data,  puts  stringent  require¬ 
ments  on  tasks  for  use  in  PETER.  One  of  the 
requirements  of  such  an  ANOVA  is  "compound 
symmetry"  of  the  variance-covariance  matrix 
(Winer,  1962).  This  requirement  can  be  shown 
(Anderson,  1968)  equivalent  to  requiring:  (a) 
homogenity  of  variances  across  conditions  and 
(b)  differential  stability,  i.e.,  the  correla¬ 
tions  between  trials  must  be  constant 
(  »  -  »  ,  i^j).  Usually  the  first  of  these 

requirements,  homogenity  of  variance,  can  be 
met  by  either  differential  weighting  of  observ¬ 
ations  or  transformations  (Scheffe',  1959). 
Hence  differential  stability  is  the  critical 
assumption  for  conventional  analysis  of  en¬ 
vironmental  paradigm  data.* 

Differential  stability,  in  light  of  its 


experimental  importance,  has  been  surprising¬ 
ly  little  studied.  However,  a  few  researchers 
(e.g.,  Fleishman,  1967)  have  shown  instability 
for  some  tasks  by  demonstration  of  systematic 
variations  in  correlations  between  a  reference 
battery  and  trial-to-trial  performance  on  a 
task.  In  addition,  the  decline  in  between- 
trial  correlations  (sometimes  to  zero)  with 
increasing  trial  separation  has  been  noted  by 
Jones  (cf,  1962  and  1972)  and  followers 
(Kennedy  &  Bittner,  1978b)  to  suggest  differ¬ 
ential  instability  for  almost  all  tasks  with¬ 
out  extensive  practice.  Kennedy  and  Bittner 
(1978b),  in  their  study  of  potential  tasks  for 
PETER,  have  noted  differential  instability 
even  where  mean  and  standard  deviations  have 
"plateaued"  and  most  experimenters  would 
assume  sufficient  "stability"  for  conduct  of 
research.  More  recent  PETER  investigations 
have  also  found  many  tasks  differentially 
"unstable"  after  practice  ordinarly  thought 
sufficient  for  their  experimental  utilization 
(Kennedy,  Bittner  6  Harbeson,  1979).  Clearly, 
there  is  need  of  methods  for  assessing  if  and 
when  tasks  obtain  differential  stability. 

Purpose 

The  purpose  of  this  report  will  be  to 
evaluate  three  methods  of  assessing  different¬ 
ial  stability. 

TESTS  OF  STABILITY 

Three  tests  of  differential  stability 
will  be  described  below:  (1)  Graphical;  (2) 
Early  versus  Late  Correlational  ANOVA;  and  (3) 
Lawley  (1963)  Test  of  Correlation  Equality. 
Each  of  these  tests  will  be  illustrated  by 
using  the  between  trial  correlations  obtained 
from  thirteen  subjects  who  practiced  a  video 
game,  ATARI  Air  Combat  Maneuvering,  for  10 
trials  a  day  over  15  days.  Table  1  gives  the 
ATARI  correlation  matrix  which  has  been  des- 


*Mul t ivariate  profile  analysis  of  basic  environmental  paradigm  data  can  be  conducted,  despite  the 
lack  of  differential  stability  (or  homogenity  of  variances),  if  the  number  of  subjects  exceeds  the 
number  of  trials.  Lack  of  differential  stability,  however,  implies  that  the  character  of  what-is- 
being-measured,  is  not  constant  over  trials.  Hence,  while  statistically  valid,  multivariate  analy¬ 
sis  may  yield  results  meaningless  from  a  scientific  stand  point. 


in 


crlbed  elsewhere  (Jones,  Kennedy  &  Bittner, 
1979). 

Graphical  Analysis 

Studies  of  differential  stability  by 
Graphical  Analysis  have  been  reported  by 
Kennedy  and  Bittner  (1978  a&b)  and  Kennedy 
et.al.  (1979).  While  not  yielding  a  strictly 
statistical  test.  Graphical  Analysis  permits 
visual  understanding  of  task  progression  toward 
and  attainment  of  stability.  If  present. 
Consider  Figure  1  which  portrays  the  correla¬ 
tions  between  selected  base  days  (l,  2,  A,  6, 
10  &  12)  and  those  which  follow.  It  was  con¬ 
structed  by  selecting  a  row  of  Table  1  corre¬ 
sponding  to  a  base  day  of  Interest  (e.g..  Day 
2)  and  plotting  the  correlations  to  the  right 
of  the  diagonal  In  terms  of  "Days  After  Base 
Day  (DABD)“  i.e.,  (r,..  =  .92  at  1  DABD,  r?  - 
.87  at  2  DABD).  Differential  stability  can  be 
determined  from  the  traces  such  as  portrayed  In 
Figure  1,  by  noting  where  the  slopes  of  later 
Base  Day  traces  approximate  zero  and  overlay 
one  another.  A  zero  flat  slope.  It  is  note¬ 
worthy,  Indicates  that  correlations  are  stable 
in  value  and  the  overlay  of  traces  indicates 
that  correlations  are  equal  across  Base  Days. 
Examining  Figure  1,  traces  for  Base  Days  1,  2 
and  A  are  seen  to  lie  below  a  cluster  of  later 
Days  and  to  have  apparently  negative  slopes. 
Traces  for  Base  Days  6  and  later,  however, 
appear  to  effectively  overlay  one  another  and 
have  zero  slopes.  From  Graphical  Analysis, 
therefore,  It  appears  that  differentia!  stabil¬ 
ity  has  been  obtained  on  the  ATARI  task  by  the 
sixth  day' of  practice. 

Early  vs  I.ate  Days  Correlation  ANOVA 

Jones  (1979)  has  defined  and  applied  this 
method  of  stability  analysis.  Following  Jones 
It  can  be  argued  that  if  stabilization  occurs, 
the  practice  days  can  be  divided  into  an 
"earlier"  and  a  "later"  segment  such  that:  (a) 
the  correlations  between  all  of  the  later  days 
and  one  of  the  earlier  days  Is  constant  and  (b) 
the  correlation  between  any  two  later  days  Is 
the  same.  This  Early  vs.  Lite  days  division, 
Jones  (1979)  observes,  can  be  seen  in  examina¬ 
tion  of  a  table  of  cross-correlations  and 
subjected  to  ANOVA. 

Delineation  of  Jones  (1979)  method  of 
analysis  can  be  made  with  the  ATARI  data  given 
in  Table  1.  Consider  Table  2  which  presents 
the  correlations  between  the  first  six  (tent¬ 
atively  early)  and  the  last  nine  (tentatively 
"late"  days.  The  rows  subject  to  sampling 
variation,  appear  to  meet  the  first  (a)  of 
lones  conditions  for  stability  with  relative 


consistancy  going  across  any  row.  The  average 
correlations  for  the  columns  present  support 
for  meeting  the  second  (b)  of  Jones  conditions 
as  there  appears  to  be  no  change  at  all  from 
Day  7  to  Day  15.  In  other  words.  Day  7  cor¬ 
relates  no  more  strongly  with  the  first  six 
days  than  Day  15  does.  It  appears,  therefore, 
that  the  ACM  task  is  completely  stabilized 
after  Day  6.  Table  3  summarizes  the  results 
of  a  two  way  analysis  of  variance  (ANOVA) 
carried  out  on  the  correlations  In  Table  2. 
Only  the  linear  columns  component  is  of  In¬ 
terest  because  it  reflects  the  flatness  of 
early  correlations  with  later  days.  Being 
nonsignificant  (F“1.0,  pX6),  the  tentative 
Interpretation  of  stability  of  correlation 
after  Day  6  is  confirmed.** 

Lawley  Test  of  Correlation  F.quallty 

Lawley  (1963)  has  proposed  a  test  for  the 
equality  of  all  correlations  in  a  matrix, 
i.e.,  //„:  p ,  p  (i  #  j).  His  test,  is  an 
approximation  of  a  likelihood-ratio  test  and 
rests  on  the  assumption  that  the  underlying 
distribution  of  observations  is  multivariate 
normal.  Lawley's  test  statistic  (Morrison, 
1967)  can  be  written 


A: 


V  V 


*  - 1 


where  for  p  variates  and  N  subjects 

a  ^  V  —  I 
-  I  -  f 

„  _  -  n:(l  n 

“  p  —  (;i  -  '.’A: 


-  V  V  , 
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Under  the  assumption  H  ,  Lawley  (1963)  has 
shown  that  asymptotically  his  test  statistic 
is  chi-squared  distributed  with  df**j(P+l )  (P-2) 
degrees  of  freedom.  Applying  the  Lawley 
statistic  to  the  36  correlations  among  days  7 
through  15  of  Table  1,  it  can  be  found  that 
the  chi-squared  is  39.82  which  for  35  degrees 
of  freedom  is  nonsignificant  ( p^ .75).  Hence 
the  conclusion  of  differential  stability 
subsequent  to  the  sixth  day  Is  again  con- 
f 1 rmed . 


**Non  linear  column  effects,  it  Is  noteworthy,  are  not  of  Interest  as  they  largely  reflect  non 
systematic  sampling  variations.  In  this  case,  the  spuriously  low  average  correlations  on  Day  10,  a 
Friday,  Is  largely  responsible  for  the  nonlinear  effect. 


I  I 


DISCUSSION 


1 


Each  of  the  three  stability  tests  was 
found  to  indicate  differentially  stability  for 
the  ATARI  ACM  task  by  Day  7.  This  concensus, 
however,  masked  important  differences  between 
the  three  methods.  These  differences  will  be 
described  below  and  recommendations  will  be 
made  for  statistical  method  selection. 

Test  Differences 


Recently,  Jones  (1979)  has  pointed  out 
that  fue  Early  vs  Late  Days  and  Lawley  methods 
examine  stability  differently.  The  Lawley  test 
has  its  focus  on  the  equality  of  all  correla¬ 
tions  within  a  series  of  consecutive  trials. 
Therefore,  it  can  be  expected  to  be  sensitive 
to  local  deviation  in  correlations,  reflecting 
more  accidental  disruptions  in  performance  than 
changes  in  differential  stability  (e.g.,  an 
unscheduled  break  during  testing  for  some 
subjects).  The  Early  vs  Late  test,  in  con¬ 
trast,  has  its  focus  on  systematic  (linear) 
changes  in  average  correlations  with  an  exter¬ 
nal  criteria  (early  trials).  Local  instabili¬ 
ties,  effecting  the  Lawley,  would  be  expected 
to  have  little  impact  on  the  Early  vs  Late 
method.  Jones  (1979)  has  defined  the  stability 
measured  by  the  Lawley  as  local  and  that  by  the 
Early  vs  Late  as  general .  In  light  of  Jones 
distinction.  Graphical  Analysis  can  be  seen  to 
focus  on  "general"  stability  paralleling  the 
Early  vs  Late  method. 

Each  of  the  three  stability  methods  can  be 
distinguished  by  "cautions"  for  the  potential 
user.  Graphical  Analysis  in  particular,  is 
not,  strictly  speaking,  an  objective  statisti¬ 
cal  technique.  It  does  not  yield  an  alpha 
level  or  other  numerical  assessment.  Interpre¬ 
tation  oi  graphical  traces  requires  a  "know¬ 
ledgeable  eye"  and  disagreements  between  analy¬ 
sis,  although  infrequent,  are  possible. 

The  Early  vs  I.ate  Days  Correlational  ANOVA 
is  more  objective  than  the  Graphical  technique, 
but  the  Early  vs  Late  Days  test  statistic  may 
yield  significance  levels  which  are  substan¬ 
tially  in  error.  An  arguement  to  show  this 
possibility  can  be  made  from  the  observation 
that  elements  in  estimated  covariance  matrices 
have  correlated  errors  (cf  Anderson,  1958) . 
Consequently,  correlations  estimated  from 
covariance  matrix  elements  will  also  have 
correlated  errors,  errors  which  might  be  ex¬ 
pected  to  impact  significance  levels  at  a 
substantial  level  if  experience  with  lag  cor¬ 
related  errors  is  any  Indication  (Scheffe', 
1959,  Chap.  X).  It  can  be  noted,  however,  that 
analysis  has  suggested  that  the  impact  of 
correlated  errors  for  matrices  arising  from 
reliability  studies  will  be  to  inflate  the  ap¬ 
parent  significance  level.  Hence  a  nonsign¬ 
ificant  (linear  column)  result  for  the  Early  vs 
Late  Days  Analysis  would  support  the  view  that 
a  task  is  stable. 


The  Lawley  Test,  as  with  both  the  other 
methods  described  above,  must  be  used  cautious¬ 
ly  by  researchers.  It  is  based  on  an  assump¬ 
tion  of  multivariate  normality  which  if  vio¬ 
lated  could  yield  grossly  inappropriate  esti¬ 
mates  of  alpha  level.  An  arguement  for  this 
sensitivity  to  nonnormal  1 ty  can  be  constructed 
following  that  for  the  sensitivity  of  tests 
for  homogenity  of  variance  (e.g.,  Bartlett) 
given  in  Scheffe'  (1959).  Thus  the  user  of 
the  Lawley  Test  must  attend  to  the  multi¬ 
variate  distribution  underlying  observations. 

Recommendat ions 


One  goal  of  stability  research  is  to 
determine  if  differential  stability  is  suf¬ 
ficient  for  utilization  of  a  task  in  an  exotic 
environment.  for  many  tasks.  Graphical 
Analysis  alone  Is  sufficiently  precise  to  meet 
this  goal.  In  cases  of  massive  declines  in 
reliability  (e.g.,  McCauley,  et.al.,  1979), 
the  task  can  be  rejected  without  resort  to 
more  elegant  techniques.  In  other  cases 
(e.g.,  Seales,  et.al.,  1979),  the  graphical 
evidence  for  stability  is  so  marked  that 
evidence  from  the  Early  vs  Late  and  Lawley 
Tests  could  be  discounted  as  meaningless  from 
a  practical  standpoint.  Even  in  cases  where 
stability  or  instability  is  difficult  to 
assess  (e.g.,  Kennedy  &  Bittner,  1978a;, 
Graphical  Analysis  is  sufficiently  precise  to 
indicate  sufficient  (practical)  stability  for 
task  use  in  a  limited  number  of  test  periods. 
Because  of  the  wide  utility  and  simplicity  of 
Graphical  Analysis,  it  is  suggested  as  the 
first  step  in  stability-  analysis.  Reliance  on 
a  non  graphical  method  can  he  confined  to 
situations  where  graphical  analysis  is  incon¬ 
clusive.  In  cases  where  confirmation  is 
required.  Early  vs  Late  Days  appears  the 
current  method  of  choice.  It  measures 
“general  stability"  which  is  more  practical lv 
meaningful  than  "local  stability"  assessed  bv 
the  Lawlev.  Hence,  Graphical  Analysis  Is  the 
recommended  method  of  tirst  choice  with  F.arlv 
vs  Late  Days  Analysis  recommended  onlv  where  a 
special  need  for  confirmation  manifests  it¬ 
self. 
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Figure  I.  Comparison  of  reliabilities 
for  selected  Base  Davs  (1,2,4,6,10  4  12). 
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Table  1 

Correlations  Among  Days 
for  Atari  ACM  Task 
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Table  2 

Correlations  Between 
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Table  3 

ANOVA  For  Data  in  Table  2 
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PHYSIOLOGICAL  AND  PERFORMANCE  MEASUREMENTS:  A  TIME-SERIES  MODEL 
Robert  C.  Carter 

Naval  Aerospace  Medical  Research  Laboratory  Detachment,  New  Orleans,  LA  70189 


Some  of  the  most  interesting  phenomena  of 
psychology,  physiology,  and  medicine  develop  over 
time.  Investigators  of  these  dynamic  phenomena 
suggest  that  the  best  way  to  study  them  is  to 
measure  an  individual  repeatedly,  and  to  gain 
general izab ility  by  studying  several  individuals. 
For  example,  Hecht,  Haig,  and  Chase (3)  studied 
individual  dark  adaptation  curves  because  compo¬ 
site  curves  obscure  the  premier  feature  of  adap¬ 
tation:  the  rod-cone  break.  Similarly,  Kstes(4) 

showed  that  learning  curves  based  on  group  data 
misrepresent  learning  by  individuals.  More 
recently,  Klien  and  Amiitage(7)  demons t rated 
90-minute  oscillations  of  mental  abilities  which 
would  be  obscured  by  averaged  performance  curves 
and  classified  as  error  variance  by  traditional 
statistical  analyses. 

Data  such  as  these  are  in  the  form  of  a 
series  of  observations  separated  by  equal  inter¬ 
vals  of  time  (a  time  series),  in  which  each 
observation  depends  on  those  which  precede  it. 
Traditional  methods  of  data  analysis  are  inade¬ 
quate  for  these  kinds  of  data  because  "ordinary' 
parametric  or  nonparametric  statistical  proce¬ 
dures  which  rely  on  independence  or  special 
symmetry  in  the  distribution  function  are  not 
available  nor  are  the  blessings  endowed  hy  ran¬ 
domization"  (2)  . 

In  response  to  this  dilemma.  Box,  Jenkins 
and  their  colleagues  have  recently  developed  a 
system  for  time  series  analysis(l).  Their  model 
is  similar  to  the  psychological  model:  S^0-*R, 
in  which  a  series  of  Stimuli  (S)  cause  an 
Organism  (0)  to  produce  a  series  of  Responses 
(R).  In  the  Box-Jenkins  model,  the  Stimuli  at 
times  t  are  called  "Input^",  the  Organism  is 
called  a  "Transfer  Function",  and  the  responses 
are  called  "Output  ".  The  organism's  response¬ 
time  and  memory  are  represented  by  delays  in  the 
transfer  function  (d.  is  a  delay  of  i  epochs,  see 
Figure  1). 
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Figure  L.  Block  Diagram  of  the  Box-Jenkins  Model 


The  influence  of  past  inputs  and  outputs  on 
present  output  is  proportional  to  the  model's 
parameters  (tf  and  ^'s). 

The  Box-Jenkins  model  Is  a  dynamic,  stoch¬ 
astic,  discrete  representation  which  offers  the 
following  to  students  of  performance  and  physio¬ 
logical  variables  (FFV):  1)  Insight  into  how 
FFV  change  over  time,  including  estimates  of 
their  differential  equat ions ( 1 ) ;  2)  Identifi¬ 
cation  of  rhythms  and  periodicities  of  PPV  and 
phase  relations  among  FPV(l);  3)  Reduction  of 
error  variance  by  explaining  some  of  that 
variance  as  covariance  among  observat ions (1 ) ;  4) 

Explanation  of  how  PFV  (e.g.  performance  test 
scores)  change  in  response  to  other  PPV  (e.g. 
vibration  exposure  history) (1);  5)  Dynamic  fore¬ 
casts  of  PPV,  including  point  estimates  and 
confidence  intervals  which  change  appropriately 
for  each  future  time (8);  and  6)  Assessment  of 
whether  some  intervent ion(2)  (e.g.  clinical  or 
environmental)  affects  the  level  of  a  PPV.  Both 
univariate( 1 )  and  mul t ivariate (9)  models  are 
available  for  each  of  these  objectives.  The 
general  applicability  of  the  Box-Jenkins  model  to 
behavioral  phenomena  is  illustrated  by  the  fact 
that  a  simple  Box-Jenki ns-type  model (3)  explains 
the  simplex  matrix  of  intertrial  correlations, 
which  characterizes  all  known  repeated-measures 
data . (6) 

Some  of  the  uses  of  the  Box-Jenkins  time 
series  model  can  be  exemplified  with  data  on 
tests  of  arithmetic  ability  collected  at  6  A.M., 

2  P.M.,  and  10  P.M.  on  each  of  21  successive 
days.  Models  of  the  obtained  performance  were 
built  using  procedures  described  by  Box  and 
Jenkins(l).  The  primary  basis  of  such  models  is 
the  correlations  between  observations  separated 
by  a  fixed  number  of  me  isurements:  autocorrela¬ 
tions.  Figure  2  shows  the  autocorrelations  of 
addition  tests. 
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NMMREK  OF  h  NOt'R  PERIODS  SEPARATING  MEASUREMENTS 

Figure  2.  Autocorrelations  among  repeated 
measurements  of  addition. 

It  indicates  that  there  is  a  relationship  between 
scores  obtained  at  24-hour  intervals.  The  nature 
of  this  24-hour  cycle  is  that  perfomance  at  2 
P.M.  was  usually  poorer  than  performance  at  6  A.M. 
or  10  P.M.  The  same  24-hour  cycle  was  discovered 
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In  subtraction,  multiplication,  and  division  test 
performance.  For  Instance,  models  of  addition 
and  subtraction  performance  are,  respectively,  z 
*  .326z  ^+29.569  and  t  m  .477z£  .+42.33,  where* 

z(  is  trie  number  of  arithmetic  problems  worked 
correctly  during  the  t —  four-minute  trial  of  the 
experiment.  All  coefficients  in  these  models  are 
statistically  significant  (£<.05),  and  the 
subtraction  model,  for  example,  reduces  the  error 
variance  of  that  series  by  21%.  A  Chi-squared 
test  for  residual  autocorrelation  in  the  modeled 
series  indicates  that  the;  addition  and  subtrac- 
tionmodels  are  complete, X  (24)“9.35,  £>.5; 
and  X  (24)=2  5.04,  £>.3  respectively.  Such 
models  may  be  used  for  description  of  a  process, 
for  intervention  analysis,  or  for  forecasting. 

Dynamic  forecasts  of  addition  performance 
(scaled  to  have  a  mean  of  50)  are  shown  in  Figure 

3.  A  separate  forecast  is  generated  for  each 
time  in  the  future.  Forecasts  of  the  distant 
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Figure  3.  Dynamic  Forecasts  of  Addition  Scores, 
with  93%  Confidence  Intervals. 


departure  frcxn  traditional  static  models  and 
their  adoption  would  require  a  shift  to  experi¬ 
mental  designs  that  include  measurement  of  a  few 
individuals  on  numerous  occasions. 
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future  approach  the  series  mean,  and  their 
variances  approach  the  variance  of  the  series. 
Forecasts  of  the  near  future  differ  appreciably 
from  the  series  mean,  and  have  reduced  variance 
due  to  the  covariance  between  the  near  future  and 
the  (now  certain)  past.  Note  that  traditional 
95%  confidence  intervals  (^  2  S.D.)  will  often  be 
too  liberal  or  too  restrictive,  compared  with 
confidence  Intervals  based  on  a  Box-Jenkins 
model.  Dynamic  forecasting  has  obvious  appli¬ 
cations  to  manpower  planning  and  selection.  An 
application  to  aerospace  medicine  would  be  the 
comparison  of  observed  PPV  scores  with  predicted 
scores  to  indicate  the  incipient  disability  of 
critical  personnel  (e.g.  aircraft  pilots). 

To  summarize,  Box-Jenkins  time  series  models 
deserve  our  cons iderat ion  as  an  aid  to  under¬ 
standing,  prediction,  and  control  of  psycho¬ 
logical  and  physiological  processes  which  unfold 
over  time.  These  dynamic  models  represent  a 
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A  SIGNAL  DETECTION  THEORY  FUNCTION  AND  PARADIGM  FOR  RELATING  SENSITIVITY  (d’)  TO  STANDARD  AND  COMPARISON  MAGNITUDES 


Alvah  C.  Bittner,  Jr. 

Naval  Blodynamlcs  Laboratory,  New  Orleans,  LA  70189 
Summary 

A  general  signal-detection  theory  (SDT)  psycho¬ 
metric  function  Is  derived  which  relates  both  comparison 
(0^)  and  standard  (0^)  stimulus  magnitudes  to  sensiti¬ 
vity  (d ' ) .  Applicable  to  a  breadth  of  stimulus 
dimensions,  this  function  is 

d'ij  -  P1(ln(0i  +  P2)  -  ln(0j  +  P2)) 

where  Pj  and  P2  are  constants.  To  Illustrate  a  paradigm 
for  identifying  the  P^  (1  -  1,2),  three  subjects  per¬ 
formed  a  lif ted-weight  task.  Subjects  made  64  judge¬ 
ments  at  each  of  six  standards  (0.1  to  1.3kg),  with 
eight  comparison  weights  per  standard  (91%  to  109%  of 
standard) .  The  results  of  analyses  of  individual 
subject's  data  by  nonlinear  least  squares  revealed  that 
the  general  model  provided  significantly  better  fit 
over  other  models  (g<10  )  and  accounted  for  94%  of 

each  subject's  total  variation.  The  centroid  of  this 
model  was  determined  to  be 

d'tj  -  2O.32(ln(01  +  0.0785)  -  ln(0j  +  0.0785)) 

where  model  parameters  were  the  average  of  respective 
subject  parameters.  Comparisons  of  this  centroid  model 
and  historical  results  are  made.  It  Is  concluded  that: 
the  utility  of  functions  relating  sensitivity  to  both 
standard  and  comparison  magnitudes  is  greater  than  the 
traditional  partial  expressions;  and  the  multiple- 
standards-comparisons  paradigm  provides  for  a  powerful 
comparison  of  psychometric  functions. 

Introduction 

Background 

Signal  detection  theory  (SDT)  has  Introduced 

decision  analysis  Into  Psychology  as  a  model  for  human 
7  14 

psychophysical  behavior  ’  .  Largely  based  on  the 

32 

mathematical  work  of  Wald  ,  STD  assumes  that  an  "ideal- 
observer"  can  calculate  the  probabilities  of  an  observed 
stimulus  having  been  produced  by  a  signal  (plus  noise) 
or  by  noise  alone.  These  probabilities,  SDT  further 
assumes,  are  combined  by  the  Ideal-observer  with  a 
priori  signal  probabilities  and  decision  costs  (or 
payoffs)  Into  a  likelihood  ratio  "classification  func¬ 
tion".  This  classification  function  Is  used  to  opti¬ 
mally  decide  whether  or  not  an  observed  stimulus  con- 
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tained  a  signal 1 5’ .  Usually  making  the  (local) 
assumptions  that  signal  and  noise  distributions  are 
Gaussian  with  disslmiliar  means  and  common  variance, 
SDT  identifies  a  sensitivity  metric  (d')  which  Is 

mathematically  invariant  for  differing  cost  and  a 

14  25 

priori  probability  conditions  .  The  utility  of 
SDT  theory  lies  in  the  approximation  of  the  ideal - 
observer  model  to  human  behavior.  This  approximation 
is  relatively  close  with  human  sensitivity  (d') 
having  been  found  to  be  relatively  constant  in  studies 
where  prior  odds,  payoffs,  and  procedures  were  varied. 
These  studies  have  been  conducted  over  several  sensory 

modalities  (e.g.,  visual  and  auditory)  and  perceptual 

7  25 

tasks  (e.g.,  detection  and  discrimination)  ’ 

Since  the  introduction  of  SDT,  psychophysical  re¬ 
searchers  have  largely  focused  on  testing  either  the 
degree  that  the  human  observer  acts  as  an  ideal- 
observer  or  the  effects  of  experimental  conditions  on 
sensitivity  (d')  and  bias. 

Researchers  using  SDT  methodology  have  not  con¬ 
cerned  themselves  with  many  of  the  problems  of  classi¬ 
cal  psychophysics1^.  With  minor  exceptions  (e.g., 

37 

Wuest  ),  they  have  not  studied  the  nature  of  "psycho¬ 
metric  functions"  which  relate  judgement  probabilities 
for  stimuli  when  they  are  compared  to  a  "standard- 
stimulus".  In  addition,  SDT  based  researchers  have 
not  been  concerned  with  the  related  problems  of 
changes  in  relative  observer  sensitivity  with  changes 
in  standards  (i.e.,  changes  in  the  "Weber  Fraction" 
as  a  function  of  standard).  This  failure  to  address 
classical  problems  is,  in  part,  the  result  of  the 
conceptual  paradigm  usually  applied  by  SDT  researchers. 

In  this  paradigm,  sensitivity  (d ' )  Is  assumed  linear 

14  15 

to  the  common  measure  of  stimulus  intensity 

30  31 

For  reasons  suggested  by  Thurstone  ’  ,  this  assump¬ 

tion  is  approximately  met  because  any  measure  of 
intensity  is  (locally)  linear  to  a  scale  where  the 
assumption  would  be  valid.*  The  usual  SDT  procedure 
is  also  not  conducive  to  study  of  classical  function 
studies  because  of  the  numbers  of  observations  typi¬ 
cally  taken  to  estimate  a  single  sensitivity.  In  the 
body  of  this  report,  a  model  and  procedure  will  be 
described  which  address  the  classical  psychophysical 


*  This  is  apparent  from  the  Taylor's  Series  where 
f  (x  +  f  (*)  +  f'(x)Ax 
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problems  described  above.  Specifically,  the  report 
will  be  directed  at  the  problem  of  determining  a  SDT 
psychometric  function  relating  comparison  (0^)  and 
standard  (0^ )  stimulus  magnitude  to  sensitivity  (d'^) 
In  a  detection  or  discrimination  experiment. 


the  generality  of  Brentano'a  conjecture  when  the 

subjective  units  (p)  are  linked  to  physical  magnitudes 

14  23  28 

by  a  direct-scaled  power  law  *  *  .  The  general 

form  of  this  law,  which  will  he  used  in  derivation  of 
( 1 )  ,  can  he  wr it  ten 


1 j/-  C(0  ♦  P2)f 


The  purposes  of  this  report  are  to:  (1)  derive  a 
general  psychometric  function  which  relates  comparison 
(0.^)  and  standard  (0.)  magnitude  to  SDT  sensitivity 
(d'  );  (2)  ill  ustrate  a  paradigm  for  determining  the 

specific  form  of  the  d'^  function  with  data  from  a 
L l f ted-we ight  task;  and  (3)  to  demonstrate  the  utility 
of  the  d'  function  by  comparison  of  a  centroid  lifted- 
weight  model  with  classical  results. 


A  General  Psychometric  Function 
In  this  section,  a  function  relating  SDT  sensi¬ 
tivity  (d ’ .  )  to  comparison  (0.)  and  standard  (0  ) 
lJ  i  j 

magnitudes  will  be  derived.  This  function  wlLl  be 

shown  to  be 


d'jj  -  P1(l-n(0i  +  P2)  -  ln(0 .  +  P2>)  (1) 

where  the  (1  =  1,2)  are  constants  specific  to  a 
stimulus  dimension.  The  derivation  of  (l)  wilL  be  based 
on  the  Brentano-Ekman  Law  which  wilL  be  described 
before  preceedlng  with  the  derivation. 


Brentano-Ekman  Law 

The  Brentano-Ekman  law  Is  a  combination  of  a 

conjecture  of  Brentano,  1874,  and  contemporary  direct- 
2  3  28 

scaled  power  laws  *  .  Brentano' s  conjecture  was  that 

an  Increment  of  sensory  variability  in  subject lve  units 
(A^)  is  directly  proportional  to  the  stimulus  in  the 
same  units  (l£/) ,  i.e., 

“  K  (2) 

where  K  Is  a  constant^.  Experimentally,  in  (2)  Is 
the  amount  of  change  In  1^7  which  alters  detection  or 
discrimination  probability  Z-scores  through  a  fixed 
range  (e  .g . ,  ^J^is  frequently  determined  for  a  unity 
change  in  Z,  Z  -  1).  Hence,  (2)  can  be  rewritten  In 
the  form 


Al^/l fr*  k£Z 


where  k  is  a  constant  (k  *  K/  AZ)  .  F,kraan  and  his 
3  4  9-15 

collaborators  '  are  credited  with  establishing 


where  C,  P  and  B  are  constants  which  vary  for  percep- 
1  12  23 

tual  dimensions  .  Recent  studies  by  Teghtsoonlan 

have  Indicated  that,  for  a  simplified  version  of  (4), 

the  Brentano-Ekman  law  approximately  holds  across 

26,27 

more  than  two  dozen  perceptual  dimensions 

Hence,  the  function  (1)  which  will  be  derived,  can  be 

expected  to  have  substantial  generality. 


Derivation 

The  function  (1)  can  be  derived  by  "integration" 
of  (3),  substitution  of  (4)  into  (3),  and  insertation 
of  the  result  into  the  definition  of  (d')  sensi¬ 
tivity.  In  particular,  letting  the  incr  ements  A^and 
AZ  become  differentials  in  (3)  and  integrating, 

kZ  =  ln(f|/)  *  C*  (3) 

where  C*  is  a  constant  of  integration.  Substituting 
(4)  into  (5),  it  follows  that 


kZ  =  ln(C(0  «■  P2)  )  +C* 


Z  =  P^inCfl  +  P2)  +  C*) 


where  =  B/k  and  =  C*  +  In  C.  To  derive  (1),  it 

is  necessary  only  to  determine  Z.  and  Z.  for  stimulus 

i  J 

magnitudes  0^  and  0^  from  (7)  and  substitute  Into  the 
definition  of  sensitivity  (d'  *  -  Z^) ,  Tt  is 

pertinent  to  note  that  in  (1),  the  notation  "d'^"  Is 
used  to  indicate  functional  dependency  on  0^  and  0^  . 
Another  more  general  derivation  of  (l)  has  been  given 
elsewhere  by  Bittner^. 


A  Multiple  Standards-Compar Ison  Paradigm 
In  this  section,  a  Multiple  Standards-Compar Ison 
Paradigm  (MSCP)  will  be  illustrated  for  identifying 
the  constants  in  (1).  First,  the  Method  of  the  MSCP 
will  he  given  for  a  1  If ted-weight  task.  The  essential 
feature  of  this  method  lies  in  Its  procedure  which 
secures  data  across  several  standards  and  comparison 
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stimuli.  Second,  the  analysis  of  the  data  obtained  by 
the  MSCP  procedure  will  be  given  for  the  data  from  the 
Ilf  ted -weight  task. 

Method 

The  apparatus,  subjects  and  procedure  of  the 
Illustrated  weight  judgement  task  experiment  will  be 
described  below.  A  more  comprehensive  description  of 
these  has  been  made  elsewhere1. 

Apparatus .  Six  series  of  weights  were  used.  Each 
series  consisted  of  a  standard  and  eight  comparison 
weights.  Divided  into  two  blocks  of  three,  the  weights 
of  the  standards  were  0.1  kg,  0.4  kg,  and  1.0  kg  for 
the  first  block,  and  0.3  kg,  0.7  kg,  and  1.3  kg  for  the 
second  block.  Comparison  weights  were  from  91*  to  1093! 
of  the  standards  weight  within  a  series.  All  weights 
were  made  from  new  half-pint  paint  cans  (79  mm  In 
diameter  and  79  mm  deep)  fitted  with  lids  and  weighted 
with  lead  shot  and  cotton  wads.  To  facilitate  presen¬ 
tation,  each  weight  of  a  series  was  appropriately 
labeled  and  placed  on  a  wooden  turntable.  Weights  and 
turntable  were  hidden  from  subjects  view  by  a  felt 
curtain  through  which  they  could  reach.  An  adjustable 
chair  was  employed  so  that  subjects  could  be  seated 
with  the  elbow  resting  on  a  felt  pad  with  the  angle  of 
the  humerus  being  at  about  45  degrees  with  respect  to 
the  body's  trunk. 

Subjects.  The  subjects  (observers)  were  three 
(E-3)  enlisted  men  on  the  staff  of  the  Naval  Biodynamics 
Laboratory  as  research  volunteers.  For  six  months 
prior  to  this  study,  the  subjects  had  served  In  psycho¬ 
logical  experiments,  but  their  only  exposure  to  psycho¬ 
physical  Judgement  tasks  was  300  trials  training  on  the 
weight-task  at  0.1,  0.6  and  1.0  kg  standards  two  weeks 
prior  to  this  study.  To  qualify  as  volunteers,  the 
subjects  had  to  be  above  the  national  average  for  Navy 
enlisted  personnel  In  physical  health,  mental  health, 
and  Intelligence.  The  subjects  received  extra  compensa¬ 
tion  for  participating  In  the  research  program.  Each 
volunteer  was  recruited,  evaluated,  and  employed  In 
accordance  with  procedures  specified  In  Secretary  of 
the  Navy  Instruction  3900.3  and  Bureau  of  Medicine  and 
Surgery  Instruction  3900.6.  These  instructions  are 
based  on  voluntary  consent  and  meet  or  exceed  the  most 

stringent  provisions  of  prevailing  national  and  lnter- 
29 

national  guidelines 

Procedure.  Subjects  were  tested  In  two  blocks  of 
three  days,  with  two  weeks  between  blocks.  During  the 
first  block,  subjects  were  tested  one  day  each  with 
standards  of  0.1,  0.4,  and  1.0  kg  with  order  and 


standard  counterbalanced  by  Latin  Square.  During  the 
second  block,  standards  of  0.3,  0.7  and  1 . 3  kg  were 
tested  one  day  each  In  a  similar  counterbalanced 
manner.  Eight  comparison  weights  were  Judged  64 
times  against  each  standard  by  each  subject. 

After  being  brought  to  the  laboratory  for  a 
series,  a  subject  was  seated  so  that  his  elbow  rested 
on  a  felt  pad,  with  the  forearm  directly  forward  of 
the  shoulder.  The  subject  was  initially  told  that 
weights  would  be  placed  on  the  table  in  his  grasp, 
and  that  he  should  lift  the  weight  only  to  about  one 
Inch  (2.54  cm)  above  the  table,  bending  only  the 
elbow  while  letting  his  elbow  rest  lightly  on  the 
pad.  This  lifting  procedure  eliminated  variations  In 
data  due  to  lifting  with  wrist  or  shoulder21.  Subse¬ 
quent  to  lifting  instructions,  the  subject  was  also 
Informed  that  on  one-half  of  the  trials  the  comparison 
weights  would  be  lighter  and  on  one-half  the  compari¬ 
son  weights  would  be  heavier.  He  was  told  that  his 
job  would  be  to  judge  If  the  second  comparison  weight 
was  heavier ,  The  replies  "no"  for  not  heavier  and 
yes"  for  heavier  were  used  as  judgement  indicators. 
The  Judgement  of  the  standard  against  the  comparison 
weights  commenced  after  Instruction. 


Results 

The  method  of  fitting  models  will  be  described 
below  for  the  weight  task.  A  comparison  of  models 
will  be  subsequently  made  and  a  centroid  model  will 
be  given. 

Model  Fitting.  After  data  collection,  each 
subject's  responses  for  each  comparison  weight, 
within  a  given  subject-series,  were  first  collected 
and  the  empirical  probabilities  of  "heavier"  responses 
determined.  These  probabilities  were,  in  turn, 

A 

converted  to  preliminary  (d*^)  estimates  by 


A 

d' 


ij 


Z(Pij)  ’  Z(Pjj) 


(8) 


where  Z(P^j)  is  the  Gaussian  standard  score  transforma¬ 
tion  of  the  probability  of  "heavier"  Judgements  when 
®k  t*le  comparison  stimulus  magnitude  and  0j  the 
standard.  Each  of  three  (d'^)  functions  given  in 
Table  1  were  then  separately  fit  to  the  totality  of 
each  subject's  data  so  as  to  minimize 


f  [j,  ,  ,  r 

r  [d  ij  - (d  u  +  1 

t  PAj>  J 

(9) 


where  the  (k*l»6)  are  six  parameters  to  be  fit 
and  Is  a  Kronecker  delta*. 

Table  1 
Functions 


I:  d'y  -  P1(01  -  0j) 

UA:  d'  -  PjdnCap  -  ln(0  )) 

I IB:  d'jj  -  P1(ln(01  +  P2)  -  ln(0  +  P^) 


Models  of  the  form  (9)  with  the  P^  parameters.  It  Is 
noteworthy,  provide  for  utilizing  all  Z(P^)  data  In  a 
series  for  estimating  Z(Pjj)  rather  than  Just  the 
empirical  Z(P^) •  Statistical  and  empirical  justifica¬ 
tions  for  this  procedure  have  been  made  by  Bittner  and 
col  leagues L  ’  S 

All  minimizations  of  (9)  were  accomplished  using 

g 

the  nonlinear  least  squares  computer  program  BMDP3R  . 
This  program  employed  a  stepwise  Gauss-Newton  (totaL 
differential)  method  which  selects  the  parameter  to  be 
estimated  at  each  step  for  greatest  potential  reduction 
In  the  residual  sum-o f-squares.  Originally  developed 
by  Hartley,  this  technique  has  been  shown  to  generally 

converge  ui>re  rapidly  than  the  unstepvlse  total  differ- 
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entlal  method  In  difficult  cases  .  All  rainlmizat tons , 
empt  >/lng  BMPD3R,  were  conducted  using  at  least  three 
Initial  estimates  of  parameters.  These  initial  esti- 
nvates  were  derived  by  various  means,  such  as  graphical 
estimates,  parameters  from  simpler  models,  multivariate 
search,  and  other  more  "subjective"  techniques.  The 
use  of  several  sets  of  initial  estimates  was  to  give 
assurance  that  minimum  least  squares  were  "global"  vs. 
"local  "  . 

Comparison  of  Models*  Figure  1  shows  the  percent 

remaining  surrcs-o e  -squ ires  for  the  subject  "observers” 

over  models  f,  \\\  and  lift,  Examining  this  figure,  it 

appears  that  the  redlining  sums-o f-squares  are  less 

than  half  as  great  for  Model  IIA  than  for  Model  T. 

Table  2  which  presents  statistical  comparisons  of  I  and 

IIA  supports  this  view  with  each  subject  (observer) 

shoeing  statistical  significance  (£<F-?)**.  The 
2  1  j  i 

Pearson  P^statlsfic  1  combines  the  Individual  signi¬ 
ficance  levels  and  Indicates  over  all  significance 
beyond  £  <.  1  . 1  F.-10,  Examining  Figure  1,  It  is  also 
apparent  that  the  residual  sums-o f-s qua r for  Model 
I  IB  are  substantially  less  than  for  IIA.  This  view  is 
supported  by  the  resul ts  reported  in  Table  3  where  the 


least  significant  result  is  for  Observer  1  (p<.0f)5). 
The  statistic  indicates  that  over  the  subjects, 
the  significance  of  this  difference  between  IIA  and 
I  IB  is  beyond  £<1.5  E-7.  Overall,  the  model  I  IB  has 
provided  significantly  better  fit  than  other  models 
and  accounted  for  94?  of  each  subject's  total  varia- 
t  ion . 


1  II A  !*» 

NOOtLS 


Figure  i.  Percent  Remaining  Sum-Of- 'cuares  for  Three 
Model  s . 

Table  2 

Summary  uf  Conservative  Comp  iris, >ns  -  t 
Models  T  and  IIA  with  Standard  Parameter** 


Table  3 

Summary  <A  Comparisons  of  Model «  TIA  and  IIS 
with  Standard  Parameters** 


1  k  -  j 
0  k  +  ) 


**  E-n*lO 


TO 


Centroid  Model 


Table  4  gives  the  values  of  the  parameters  and 
P 2  each  of  the  three  observers  determined  by  fitting 
[  IB  . 


Table  4 

Parameter  Values  for 

Model  I  IB 

OBSERVER 

PARAMETERS 

P  P 

l  2 

i 

17.  17 

0.0690 

2 

25.5ft 

0.0822 

3 

18.23 

0.0844 

Based  on  the  averages  of  the  parameters,  the  centroid 
observer  modeL  Is 


d'tj  -  20.32  (ln(01  +  0.0785)  -  ln(0  +  0.0785))  (10) 

This  model  Is  a  more  appropriate  representation  of 
behavior  than  averaged  subject  performance  at  different 
levels  because  It  preserves  the  form  of  individual 
functions^ . 


Discussion 

The  theoretical  and  practical  utility  of  the  d'^. 
function  and  MSCP  will  be  discussed  In  this  section. 
Subsequent  to  a  brief  review  of  generality,  MSCP  sensi¬ 
tivity  and  d't  historical-results  comparability  will 
be  delineated.  Conclusions  will  be  made  based  on  the 
review  and  delineations. 


Generality  of  the  Derived  Function 

The  general  SDT  psychometric  function  (1)  derived 
earlier  can  be  expected  to  characterize  a  breadth  of 
stimulus  dimensions  because  of  its  basis  on  the 
Brentano-Ekman  law.  The  results  of  Teghtsoonlan ,  fn 

particular,  suggest  the  applicability  of  (1)  to  more 

?  6  ?  7 

than  two  dozen  sensory  dimensions  *  .  Using  the 

MSCP,  as  illustrated  for  the  weight  task,  the  parameters 
of  this  function  can  also  be  identified.  Hence,  appli¬ 
cation  of  results  of  this  report  can  be  expected  to 
yield  d'^j  functions  vrtiich  will  successfully  characterize 
a  wide  range  of  stimulus  dimensions. 


classically  been  made  with  data  obtained  from  a 
single  standard  and  a  set  of  comparison 
stimuli18’^’36.  With  a  restricted  range  of  stimuli, 

"...  It  (has)  consequents  been  difficult  to  dlstln- 

1  4 

guish  between  ...  hypothesis  empirically..."  The 
MSCP,  with  a  greater  range  of  stimuli,  offers  greater 
sensitivity  than  the  classical  paradigm.  This  can  be 
seen  by  noting  the  strength  of  the  comparisons  of 
Model  I,  1IA,  and  TIB  as  seen  In  Figure  1  and  Tables 
and  3.  Viewable  as  analogous  to  the  ph i- log- gamma 
hypothesis,  Model  IIA  was  seen  to  have  less  than  half 
the  residual  sums-of-squares  as  Model  1  which  is 
similarly  analogous  to  the  phi-gamma  hypothesis.  For 
each  of  the  three  observers,  this  difference  was 
highly  significant  (p<E-7)  and,  across  observers 
this  difference  was  very  highly  significant  (p< 

1.1E-10).  In  addition,  Model  TIB  which  is  analogous 
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to  a  generalized  phi-log-gamma  hypothesis  was  found 
to  have  20%  to  30%  less  residual  than  Model  I tA ,  and 
across  observers  this  result  was  also  very  highly 
significant  (p<1.5E-7).  The  MSCP  ofcers  consi¬ 
derable  sensitivity  for  comparison  of  psychometric 
functions . 

Centroid  Model  and  Historical  Restilts 

The  centroid  model  (10)  contains  similar  infor¬ 
mation  to  that  contained  in  a  large  body  of  classical 
results:  (a)  body  of  Weber  Fraction  Results;  and  (b) 
Brown’s  Single  Observer  Results. 

Weber-Fract ion  Results.  Figure  2  shows  Wefcer- 

Fraction  ((X./0)  results  obtained  bv  Fechner^,  Brown*’ 
33  0  21 

Woodrow  ,  Oberlin  ,  and  an  exercise  of  Model  (10). 


WMOMT  O'  lUNDUD  lOtAMI. 


Figure  2.  Comparison  of  Empirically  Established 
Poikilitic  Model  from  Current  Study  with  Classical 
Results . 


MSCP  Sensitivity 

Psychometric  law  comparisons  have  frequently 
contrasted  Fechnerian1^  phi-gamma  and  Thurstonian30’3 1 
phi-log-gamma  hypotheses.  These  comparisons  have 


Examining  Figure  2,  it  can  be  seen  that  (10)  follows 
the  body  of  classical  results.  In  particular,  the 
model  is  seen  to  be  virtually  on  top  of  the  results 


of  Oberlln  from  0.025  kg  to  about  0.1  kg.  From  0.1  kg 
to  0.2  kg,  the  model  overlays  Woodrow's  findings.  From 
0.2  to  0.6  kg,  the  model  results  are  seen  to  parallel 
Oberlin’s  and  Fechner's  with  the  paralleling  of  Fechner 
extending  to  3.0  kg.  Hence,  in  addition  to  results  of 
this  investigation,  the  centroid  model  (10)  represents 
a  body  of  Weber-Fract Lon  results  from  previous  investi- 
gat ions . 

Brown's  Single  Observer  Data.  Figure  3  compares 
data  collected  by  Brown**  with  centroid  model  (10) 
estimates  adjusted  tor  sensitivity. 


Figure  3.  Comparison  of  Brown  and  Adjusted 
Centroid  Model  Results. 

The  Brown  results  were  obtained  in  a  two-category 
weight-lifting  experiment  with  a  standard  at  0.1  kg  and 
comparison  stimuli  ranging  from  0.092  to  0.118  kg  in 
0.00 l  kg  Increments.  Employing  the  method  of  constant 
stimuli  with  700  trials  at  each  comparison  value. 
Brown's  results  are  the  most  comprehensive  study  of  any 
single  psychometric  function  in  the  l  i  terature  ^  . 

The  model  (10)  was  adjusted  by  multiplying  all  sensLti- 
vity  estimates  by  1.15  as  suggested  by  differences 
between  the  centroid  model  and  Brown's  Weber-Fraction 
results  seen  in  Figure  2.  On  examaination  of  Figure  3, 
the  near  overlay  of  a  reasonably  adjusted  centroid 
model  and  Brown  data  is  seen. 

Concl us  ions 

It  can  be  concluded  that:  (a)  the  d’^  function 
has  wide  potential  for  description  of  sensitivity 
across  sensory  dimensions;  (b)  the  utility  of  the  d'^ 
function  relating  sensitivity  to  both  standard  and 
comparison  is  greater  than  traditional  partial  expres¬ 


sions;  and  (c.)  the  MSCP  provides  for  more  sensitivity 
in  comparing  hypothetical  psychometric  functions  than 
traditional  paradigms. 
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the  three  measures  of  stability  (means,  standard  deviations.  Intertrial  correlations)  can 


constant  slope  correction  and  a  .650  task  definition 


INTRODUCTION 

Performance  stability  is  an  Important  concept  In 
both  the  experimental  and  Industrial  environments. 
Presently,  this  construct  Is  helping  to  develop  a  per¬ 
formance  battery  (PETEK,  Performance  Evaluation  Tests 
for  Environmental  Research),  which  will  eventually  he 
used  to  study  behavior  under  unusual  and  adverse  con¬ 
ditions.  The  usefulness  of  a  test  for  this  purpose  Is 
determined  by  the  unchanging,  stable  scores  In  the 
baseline  or  controlled  condition.  This  criterion  Is 
Important  because  any  effect  associated  with  repeated 
measurement  would  be  confounded  with  changes  of  perfor¬ 
mance  due  to  the  envl ronment .  Stability  (Jones,  1979) 

Is  defined  as  the  period  when  (1)  mean  performance 
reaches  nearly  constant  slope  over  time,  (2)  between 
subject  variances  are  homogeneous  over  time,  and  (1) 
relative  performance  standings  of  the  subjects,  re¬ 
flected  in  cross-session  reliabilities,  are  constant 
over  time. 

The  Implications  of  this  research  can  be  general¬ 
ized  to  the  Industrial  workplace.  For  example,  the 
statistical  properties  of  stabUity  have  application 
when  learning  curves  are  utilized  as  tools  of  manage¬ 
ment  for  purposes  of  scheduling,  productivity,  training 
and  forecasting  (Moore,  Tablonski,  1969).  With  prac¬ 
tice,  people  improve  their  ability  to  do  work,  which 
can  he  evidenced  by  Increases  In  such  diverse  skills 
as  scanning  rate  and  discrimination,  memory  and  rule¬ 
using,  time-sharing  and  planning,  movement  efficiency 
and  precision.  As  workers  gain  experience  cither 
through  formal  training  or  on-the-job  exposure,  their 
productivity  Increases  rapidly  at  first;  but  then  as 
performance  on  a  particular  job  or  task  Is  optimized, 
the  learning  curve  flattens  or  levels  off.  This  flat¬ 
tening  period  Is  synonymous  with  stability.  The  valid 
determination  of  this  property  as  It  relates  to  produc¬ 
tion  levels  and  the  daily  reliability  of  labor  Is 
critical  to  forecasting  and  scheduling  within  an 
organ  Lzat ion . 

Another  tool  of  Industry  with  which  stability  has 
application  Is  that  of  control  charts.  Manufacturing 
processes,  even  when  controlled,  have  a  certain  amount 
of  varlahtlltv  which  cannot  be  eliminated.  When  this 
vartahll itv  Is  confined  to  random  or  chance  variation, 
the  process  Is  considered  to  he  within  statistical 
-ontrol  IMllIer,  Freund,  1965).  This  period  of  control 
(s  svnonymous  with  stability.  Control  charts  con¬ 
sisting  of  a  central  line  and  upper  and  lower  limits 
for  the  mean,  standard  deviation  and  the  range  can  he 
Utilized  for  the  purpose  of  detecting  serious  devia¬ 
tions  from  stability.  These  limits  are  determined  by 
setting  statistical  confidence  bands  (+3  )  around 

tVie  estimated  population  mean  and  standard  deviation. 
Those  estimated  values  are  usually  derived  by  averaging 
the  statistics  of  thp  samples  collected  during  the 
period  of  process  control.  By  plotting  the  results 
obtained  from  the  samples,  the  determination  of  stabi¬ 
lity  can  be  judged  by  the  number  of  values  Inside  or 


outside  of  the  confidence  limits.  Although  control 
charts  usually  have  application  to  equipment  variables, 
they  are  quite  suited  to  the  analyses  of  worker 
variables . 

In  the  two  industrial  applications  of  stability 
just  mentioned,  learning  curves  and  control  charts, 
the  emphasis  is  upon  means  and  standard  deviations. 
Intertrial  correlations,  however,  are  just  as  impor¬ 
tant  because  they  give  the  investigator  a  measure  of 
internal  reliability.  For  example,  a  theoretical 
group  of  workers,  who  are  performing  a  particular 
task,  may  decide  to  cooperate  in  the  lowering  of  their 
production  levels  during  baseline  data  collection.  If 
the  means  and  standard  deviations  were  constant,  the 
investigator  would  have  difficulty  in  determining 
whether  the  data  gave  a  valid  indication  of  performance 
achievement  and  stability.  However,  rank-order  posi¬ 
tions  on  a  daily  basis  (intertrial  correlations)  are 
more  difficult  to  manipulate,  especially  when  the 
people  being  observed  are  not  aware  of  this  subtle 
statistic.  Because  of  the  importance  of  reliability 
to  performance,  the  purpose  of  this  paper  will  be  to 
discuss  various  methodologies  which  can  be  used  to 
determine  correlat ional  stability.  A  cognitive  experi¬ 
mental  test,  which  was  conducted  at  this  laboratory, 
will  function  as  the  vehicle  of  explanation. 

METHOD 

The  grammatical  reasoning  test  TBaddeley,  19^8) 
was  scrutinized  in  order  to  determine  whether  it  was 
suitable  for  inclusion  in  the  PETER  battery  (Carter. 
Kennedy  and  Bittner,  1980).  This  test  is  purported 
to  measure  "higher  mental  processes."  Twenty-three 
subjects  took  the  test  on  15  consecutive  workdays  in  a 
standard  environment.  The  gramm.it ical  rea^  >nin>j  test 
involves  five  grammatical  t rans f orma t ions  on  statements 
about  the  relation  between  two  letters:  A  and  The 
five  trans f orma t ions  are:  (1)  active  versus  passive 
sentence  construction,  (2)  true  versus  false  statement. 
(3)  affirmative  versus  negative  phrasing.  (4)  use  of 
the  verb  "precedes"  versus  the  verb  "follows,"  and  (5) 
sequential  order  of  A  versus  R.  There  are  32  possible 
items,  and  they  were  arranged  in  a  different  random 
order  on  each  day  of  the  experiment.  The  subject 
responded  with  either  a  "True"  or  "False"  depending 
upon  the  verity  of  each  statement.  For  example, 

"True"  is  the  appropriate  response  to  the  stimulus:  A 
precedes  B  -  AB.  Subjects  were  allowed  1  minute  to 
work  on  this  paper-and-penc il  test  on  each  day  of  the 
experiment.  The  test  was  administered  to  the  subjects 
in  a  group.  Scores  were  the  number  of  correct  re¬ 
sponses  . 

RESULTS  AND  DISCUSSION 

The  results  indicated  that  the  grammatical  reason¬ 
ing  test  is  quite  suitable  for  use  In  repeated  measures 
experiments.  The  means  a  ,d  standard  deviations  appear 
In  Table  1.  The  means  increase  lineally  with  r»»  v  tloe 


(slope  *  .3  correct  respouses/Jav )  as  confirmed  by  a 
repeated  measures  analysis  of  variance.  The  linear 
component  of  the  days  effect  was  statistically  signi¬ 
ficant  (F(l,22)  “  50.39,  £  .00051,  and  accounted  for 

90^  of  the  variance  attributable  to  days.  There  was 
no  indication  that  the  variance  of  grammatical  reason¬ 
ing  scores  changed  over  the  15  days  (Fmax(l5,  22)  * 
1.82,  non-s Ignlt leant  at  .05  level).  In  order  to 
determine  the  days  causing  the  significant  deviations, 
99  percent  confidence  limits  were  placed  around  the 
average  mean  and  standard  deviation  of  the  15  days,  as 
in  the  construction  of  control  charts  (Miller,  Freund, 
1965).  Each  of  the  15  days  were  treated  as  samples 
from  the  population.  Using  the  t  distribution  for  the 
means  and  the  chi-square  distribution  for  the  standard 
deviations,  the  resulting  central  value  (CV)  with  99 
percent  upper  (UL)  and  lower  (LL)  confidence  limits 
were:  (1)  mean  -  12.62  (CV) ,  15.66  (UL) ,  9.58  (LL) , 

and  (2)  standard  deviation  -  5.06  (CV) ,  7.06  (UL) , 

3.17  (LL) .  Table  l  shows  that  none  of  the  standard 
deviations  and  one  mean  (day  1)  are  outside  these 
statistical  boundaries.  There  is  a  good  possibility, 
however,  that  if  the  experiment  had  continued,  the 
means  on  the  days  after  day  15  would  have  been  outside 
the  limitations.  If  a  correction  of  .30  constant 
slope  on  the  control  chart  for  the  mean  had  been 
utilized  as  a  forecasting  projection,  this  contingency 
would  not  occur  and  the  investigator  would  still  have 
had  an  estimation  of  stable  performance. 

Another  condition  which  is  necessary  for  stability 
is  that  of  the  intertrlai  correlations  being  constant 
over  time.  Table  l  depicts  the  task  definition  for 
each  day,  which  is  the  average  of  the  intertrial 
correlations  of  that  day  with  all  other  days.  In 
other  words,  task  definition  by  day  is  an  average  of 
14  correlations,  and  task  definition  for  the  matrix  is 
a  mean  of  210  correlations.  The  task  definition  by 
matrix  was  .72.  The  Lawley  test  (Morrison,  1967) 
indicated  that  the  intertrial  correlations  did  not 
change  appreciably  after  Day  4  (  2(44)  »  43.65, 

non-significant  at  .05  level)  but  were  not  constant 
after  Day  3  (  2(54)  =  83.29,  p  .025).  Since  day  15 

was  omitted  from  these  analyses  due  to  its  relatively 
lower  task  definition,  stability  was  noted  from  days  5 
to  14.  The  usefulness  of  intertrlai  correlations  can 
be  demonstrated  using  the  three  Indices  of  day  15. 

Since  this  day  was  an  end  point  known  to  the  subjects, 
there  may  have  been  a  lack  of  concentration  demon¬ 
strated  by  the  task  definition  (.60).  The  high  mean 
and  stable  standard  deviation  indicate,  without  the 
correlational  information,  that  the  day  15  sample  was 
performing  very  well.  However,  only  when  the  three 
indices  are  studied  together  does  a  more  complete 
picture  emerge . 

The  utility  of  the  Lawley  test  In  the  determina¬ 
tion  of  correlational  stability  is  lowered  by  the 
following  trait:  non-significant  results  indicate 
that  correlations  among  trials  are  equal,  but  a  signi¬ 
ficant  analysis  does  not  mean  that  a  differential 
change  Is  present  (Jones,  1979).  To  draw  this  conclu¬ 
sion,  another  alternative  method  is  necessary.  Such 
an  approach  may  be  factor  analysis.  This  methodology 
operates  to  maximize  the  amount  of  variance  shared 
commonly  among  the  variables.  When  the  variables  are 
days  and  the  cases  are  subjects,  stability  should  be 
Indicated  by  the  loadings  of  the  variables  as  well  as 
the  amount  of  variance  explained  by  the  first  unro¬ 
tated  factor.  This  position  is  partially  supported  by 
Humphreys  (I960)  who  believed  that  the  correlational 
matrix  containing  variables  of  successive  trials  on 
the  same  task  represented  only  one  common  factor.  In 
addition,  Corhallis  (1965)  suggested  a  linear  model  as 
an  alternative  to  the  usual  factor  model  of  multiple 
solutions.  The  one  factor  solution  presented  In  this 


paper  is  comparable  to  a  linear  model .  Table  1  lists 
the  factor  loading  on  each  day.  These  data  indicate 
that  75  percent  of  the  variance  was  explained  by  this 
analysis  and  that  the  average  factor  loading  was  .86. 

If  days  5  to  14  were  considered  as  the  stable  period 
as  indicated  by  the  Lawley  test,  the  explained  variance 
would  increase  to  85  percent,  and  the  factor  loadings 
would  be  near  or  greater  than  .90. 

The  Kolraogorov-Smirnov  (K-S)  goodness  of  fit  test 
(Miller,  Freund,  1965)  was  examined  in  order  to  deter¬ 
mine  whether  it  would  be  applicable  to  correlational 
stability  analysis.  The  one-sample  test  is  concerned 
with  the  amount  of  agreement  between  observed  and 
expected  cumulative  distributions.  For  example,  the 
test  was  utilized  in  order  to  determine  whether  task 
definition  and  factor  analysis  were  attempting  to 
explain  similar  constructs:  the  cumulative  distribution 
of  the  explained  variance  on  each  day  within  the  total 
matrix.  In  Table  1,  the  relative  cumulative  distribu¬ 
tions  of  the  squared  task,  definitions  and  factor 
loadings  were  presented.  The  distributions  are  non¬ 
significant  (p  .05  “  .073),  and  therefore,  can  he 
considered  identical.  In  fact,  a  multiple  of  1.2 
could  be  used  to  equate  each  dally  task  definition  to 
its  related  factor  loadings.  This  loading  was  deter¬ 
mined  by  dividing  .86  (average  of  factor  loadings')  by 
.72  (task  definition  by  matrix).  Since  factor  analytic 
results  having  a  one  factor  solution  and  task  defini¬ 
tion  appear  to  be  similar  constructs,  the  determination 
of  correlational  stability  can  rely  mainly  upon  task 
definition.  This  conclusion  was  further  supported  by 
the  results  from  four  other  mental  tests  (free  recall, 
interference  susceptibility,  running  recognition,  and 
list  dlf ferentatlon) .  In  addition,  these  tests  Indi¬ 
cated  that  a  .650  task  definition  may  be  an  acceptable 
standard,  since  this  value  is  comparable  to  68  percent 
of  the  variance  from  factor  analysis  and  to  an  average 
factor  loading  of  approximately  .82. 

A  one  sample  K-S  test  was  conducted  using  the 
cumulative  frequency  distributions  of  squared  task 
definitions  (observed)  and  predicted  values  based 
upon  1.0  divided  by  15.  These  predicted  scores  repre¬ 
sented  the  theoretical  distribution  of  stable  and  equal 
task  definitions.  The  absolute  maximum  difference 
point  in  Table  1  was  depicted  to  be  Day  4,  which  was 
similar  to  the  Lawley  test.  These  results  however  were 
non-significant  at  the  .1  level  (p  .2  -  .058;  p  .1  « 
.066;  p  .05  -  .073).  In  other  words,  the  K-S  test 
indicated  that  correlational  stability  was  arrived  at 
on  day  1.  The  difference  between  the  Lawley  and  K-S, 
therefore,  must  be  one  mainly  of  test  stringency.  For 
the  K-S  test  to  have  been  significant  and  independently 
distributed,  the  level  of  significance  would  have  been 
at  the  .20  level.  In  order  to  determine  the  stringency 
of  the  Lawley,  another  test  was  conducted  based  upon 
the  distribution  of  days  5-14.  The  task  definition  by 
matrix  for  these  nine  days  was  .83.  Using  the  K-S  one 
sample  test,  the  maximum  difference  was  .012.  In 
conclusion,  it  appears  that  the  Lawley  Is  very  conser¬ 
vative  and  should  be  used  with  caution. 

If  the  Kolmogorov-Smlmov  test  had  indicated  a 
significant  departure  at  day  4,  task  definitions  by 
day  would  again  have  to  be  computed  using  days  5 
through  15.  In  other  words,  an  average  of  ten  correla¬ 
tions  would  represent  the  daily  values  while  the  task 
definition  by  matrix  would  be  a  mean  of  110  correla¬ 
tions.  The  K-S  test  would  again  be  utilized  in  order 
to  determine  whether  the  distributions  of  expected  and 
observed  squared  task  definitions  were  similar. 
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TABLE  1:  MEANS,  STANDARD  DEVIATIONS,  TASK  DEFINITIONS,  FACTOR  LOADINGS, 
AND  CUMULATIVE  DISTRIBUTIONS  FOR  15  DAYS 


STD 

TASK  DEFI- 

CUM  DIST 

FACTOR 

CUM  DIST 

OBS  CUM 

OBS  CUM 

PRED 

CUM 

DIFP 

DAYS 

MEANS  DEV 

NITI0N  (T) 

(T)2 

LOADING  (L) 

(L)2 

FREQ  (T)2 

FREQ  (L) 2 

FREQ 

(P) 

(T)2-P 

1 

8.5 

3.3 

.56 

.32 

.68 

.46 

.04 

.04 

.07 

.03 

2 

9.9 

4.3 

.58 

.65 

.70 

.95 

.08 

.09 

.  i  3 

.05 

3 

9.8 

4.3 

.71 

1.16 

.85 

1.67 

.15 

.15 

.20 

.05 

4 

11.1 

4.8 

.69 

1.64 

.83 

2.37 

.21 

.21 

.27 

.06 

5 

11.6 

4.6 

.78 

2.25 

.93 

3.23 

.29 

.29 

.33 

.04 

6 

12.4 

4.9 

.79 

2.87 

.93 

4.10 

.37 

.37 

.40 

.03 

7 

13.3 

5.0 

.76 

3.45 

.90 

4.91 

.44 

.44 

.47 

.03 

8 

13.4 

4.8 

.77 

4.04 

.92 

5.75 

.51 

.51 

.53 

.02 

9 

13.1 

5.4 

.79 

4.67 

.94 

6.63 

.59 

.59 

.1  3 

.01 

10 

13.4 

4.5 

.72 

5.18 

.86 

7.37 

.66 

.66 

.67 

.01 

11 

14.7 

5.3 

.73 

5.71 

.87 

8.13 

.73 

.73 

.73 

.00 

12 

14.0 

6.0 

.76 

6.29 

.90 

8.94 

.80 

.80 

.80 

.00 

13 

14.3 

4.5 

.75 

6.85 

.90 

9.74 

.87 

.87 

.87 

.00 

14 

14.1 

5.8 

.81 

7.50 

.96 

10.66 

.96 

.95 

.93 

.03 

15 

15.5 

4.7 

.60 

7.86 

.72 

11.19 

1. 00 

1 .00 

1 .00 

.00 
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